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Higher-Order Markov Tag-Topic Models for 
Tagged Documents and Images 

Jia Zeng, Wei Feng, William K. Cheung and Chun-Hung Li 

Abstract — This paper studies the topic modeling problem of tagged documents and images. Higher-order relations among tagged 
documents and images are major and ubiquitous characteristics, and play positive roles in extracting reliable and interpretable topics. 
In this paper, we propose the tag-topic models (TTM) to depict such higher-order topic structural dependencies within the Markov 
random field (MRF) framework. First, we use the novel factor graph representation of latent Dirichlet allocation (LDA)-based topic 
models from the MRF perspective, and present an efficient loopy belief propagation (BP) algorithm for approximate inference and 
parameter estimation. Second, we propose the factor hypergraph representation of TTM, and focus on both pairwise and higher-order 
relation modeling among tagged documents and images. Efficient loopy BP algorithm is developed to learn TTM, which encourages the 
topic labeling smoothness among tagged documents and images. Extensive experimental results confirm the incorporation of higher- 
order relations to be effective in enhancing the overall topic modeling performance, when compared with current state-of-the-art topic 
models, in many text and image mining tasks of broad interests such as word and link prediction, document classification, and tag 
recommendation. 

Index Terms — Topic models, Latent Dirichlet allocation, Markov random fields, Bayesian networks, factor graph, hypergraph, higher- 
order relation, tagged documents and images, belief propagation, message passing, hierarchical Bayesian models. 



1 Introduction 

The goal of this work is to model and infer semantically 
meaningful word clusters, referred to as topics, from 
large-scale tagged documents and images. In a broad 
sense, we define a tag as a label that characterizes certain 
properties of documents and images. For example, the 
"author" tag identifies the authorship of document, and 
the "time stamp" tag marks when the document is 
published. On the other hand, we can treat images as 
documents composed of visual words. The users often 
manually annotate images by semantic tags such as 
"building" or "tree" to label local contents or objects 
of interests. Generally, one document may be associated 
with multiple tags, and one tag may be attached to mul- 
tiple documents. Fig. UK illustrates an example of tagged 
documents with the tags being authors, where the link 
denotes that the author writes the document. Fig. [j}3 
shows another example of tagged images, where four 
images are annotated with three tags "sky", "building", 
and "people". We can conveniently represent tagged 
documents and images by a bipartite graph in Fig. QJl, 
which is composed of the tag nodes {7} and the doc- 
ument or image nodes {5} connected by links. In the 
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bipartite graph, tags often connect multiple documents 
or images so that build higher-order relations among 
documents or images. 

Besides the pairwise relations, the higher-order rela- 
tions among tagged documents and images formed by 
multiple tags are major and ubiquitous characteristics. 
For example, the authors {71,72} collaborate to write 
the document Si on the topic "Machine Learning" de- 
noted by the intersection subset of two circles {71,72} 
in Fig. [lp. Similarly, {71,73} collaborate to write S3 on 
the topic "Computer Vision", and {72,73} collaborate 
to write S4 on the topic "Data Mining". These three 
authors {71,72,73} also jointly collaborate to write the 
document S2 denoted by the intersection subset of three 
circles {71,72,73} hi Fig. [TjD. If we simply decompose 
the higher-order relation {71,72,73} into three pairwise 
relations {{71, 72}, {71, 73}, {72, 73}}/ we may just come 
to the conclusion that S2 focuses on the combined topics 
of "Machine Learning", "Computer Vision", and "Data 
Mining". Nevertheless, the possibility that 82 is in fact 
modeling a totally new topic like "Computational Biol- 
ogy" in the intersection subset of three circles will be ex- 
cluded as shown in Fig. [TjD. Obviously, S2 lies in the spe- 
cific subset of {71,72,73}/ which is quite different from 
the union of subsets {{71,72}, {71,73}, {72,73}}- So, the 
explicit modeling the higher-order relation among docu- 
ments constituted by multiple tags {71, 72, 73} is needed 
to distinguish the specific topic of S2 from the combined 
topics of Si, S^, and 84. Furthermore, modeling such 
higher-order relations also reflects the truth that the tags 
{71, 72, 73} are often attached to the document S 2 jointly 
rather than separately to explain the document content. 
Similar higher-order relations among images induced by 



7i 



72 



73 



~N I k. 



#1 #2 #3 #4 

(^) 




7i 



72 



(C) 



73 




Si 




Fig. 1 . Examples of tagged documents and images: (A) the tagged documents with the tags being authors, (B) the 
tagged images with the tags being annotations, (C) the bipartite graph representation, and (D) the higher-order relation 
distinguishes the specific topic of 5 2 from the combined topics of 8i, 5 3 , and 5±. 



multiple tags can be also found in Fig. [T}3. 

However, prior efforts at pairwise relation modeling 
in topic models rarely consider higher-order 

relations that may encode specific topic structural depen- 
dencies among tagged documents and images. There- 
fore, in this paper, we propose the tag-topic models 
(TTM) to describe such higher-order topic structural 
dependencies within the Markov random field (MRF) 
framework. This approach extends our previous work in 
modeling higher-order relations of coauthors fl2) to the 
more generic tagged documents and images, allowing us 
to develop more efficient inference and parameter esti- 
mation algorithms within the theoretically well-founded 
MRF framework. 

First, we reformulate the topic modeling task as a 
labeling problem from the novel MRF perspective. We 
represent the latent Dirichlet allocation [1J (LDA)-based 
topic models as factor graphs |13|, [14 1, and develop the 
classic loopy belief propagation (BP) algorithm to make 
approximate inference and parameter estimation. Sec- 
ond, we represent TTM using the factor hypergraph [15J 
according to the bipartite graph in Fig. [TJZ, and focus on 
both pairwise and higher-order relation modeling within 
the higher-order MRF framework. Indeed, such higher- 
order MRF has recently found important applications 
in modeling high-level image structural priors in many 
computer vision problems, including image restoration, 
disparity estimation and object segmentation 1161 , ATI . 

Generally, inferring the higher-order MRF is intrinsi- 
cally a computationally expensive problem since even 
encoding M -order topic structural dependencies of J 
topics requires J M labeling configurations. However, 
similar to image structural priors, higher-order relations 
used in topic modeling also have certain properties such 
as smoothness or sparsity 117J-|19|, which makes them 
easy to handle. Intuitively, the co-tagged documents 
and images tend to have a higher likelihood to share 
the similar topic labeling configuration. Based on the 
smoothness or sparsity prior, many higher-order topic 
labeling configurations are equally unlikely and thus 
need not to be encouraged. Therefore, we encourage only 
a total of JM smooth topic labeling configurations in 
TTM, which avoids encoding J M arbitrary topic struc- 



tural dependencies. To this end, we design the higher- 
order functions to encode the major and representative 
smoothness relations, and develop the loopy BP algo- 
rithm [14J, [16J to make efficient inference and parameter 
estimation of TTM. 

The rest of this paper is organized as follows. Sec- 
tion |2] introduces related work. Sections |3] presents MRF 
for topic modeling and develops loopy BP algorithms 
for approximate inference and parameter estimation. 
Section |4] proposes TTM and focuses on pairwise and 
higher-order relation modeling among tagged docu- 
ments and images. Section [5] shows extensive experi- 
mental results on several challenging text and image 
mining tasks of broad interests. Finally, Section [6] draws 
conclusions and envisions future work. 

2 Related Work 

The use of probabilistic topic models for text mining is 
the state-of-the-art approach such as learning termino- 
logical ontologies [22J. LDA [1J is the basic topic model 
in Fig. 2A. It allocates a topic label z to each word w 
in the document d based on the document-specific topic 
proportion 6d and the topic-specific multinomial distri- 
bution <p over vocabulary words, where 8d and <f> are 
smoothed by two conjugate Dirichlet hyperparameters 
a and j3, respectively. The plates indicate replication. 
For example, the document repeats D times in the 
corpus, the word repeats Id times in the document d, 
and there are a total of J topics. LDA builds implicit 
links between two documents {d, d'} by sharing the 
same topic distribution <f>, and it encourages similar 
topic labeling configurations if two documents contain 
similar words. However, LDA uses only tj> to exchange 
topic information among documents, but ignores the 
rich link information like citations or hyperlinks between 
documents. This motivates the recent variants of LDA 
that regularize topic distribution </> through pairwise 
relations between documents. 

Pairwise topic models focus on the link generation 
process, which in turn influences the topic allocation for 
words. Link LDA |2J uses the document-specific topic 
proportion and the topic-specific distribution over doc- 
uments to generate a cited document by the citing doc- 
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ument. When two documents cite the same document, 
they tend to have similar topic labeling configuration 
over words. In this sense, link LDA indirectly depicts a 
co-citation link between documents citing the same other 
documents, and scales badly to the large-scale corpus 
because its parameters increases with the total number 
of documents. To overcome these weaknesses, pairwise 
LDA |3| directly generates the binary citation link vari- 
able between two documents using a topic-dependent 
Bernoulli distribution. But it randomly uses one of topic 
labels rather than the entire topic labels in the document 
to generate links, which significantly limits the influence 
of link information on topic regularization. To relax this 
limitation, the relational topic model (RTM) (Fig. 0B) GO 
represents entire document topics by the mean value 
of the document topic proportions. It then uses the 
Hadamard product of mean values Zd ° ~%d' from two 
linking documents {d, d'} as link features, which are 
learned by the generalized linear model £1 to generate 
the observed citation link variable t = 1. If the citation 
link variables are replaced by the observed tags, RTM 
can be adapted to account for the tagged documents and 
images. Similar to the basic idea of RTM, latent topic 
hypertext model [5[ assumes that links originate from 
words and uses partial word topic labels to generate 
links. 

Furthermore, topic-link LDA [6|, multirelational topic 
models (7), and Markov random topic fields (8) si- 
multaneously generate multiple types of links such as 
citations, coauthor relations, and social community of 
authors to improve the accuracy of topic modeling. 
Citation influence model [9] allows that the topic of a 
citing document is dependent on either its own topic 
proportions or its cited documents' topic proportions. 
Topic modeling with network regularization [10] adopts 
a graph-based regularizer to encourage the minimum 
Euclidean distance between document-layer topic la- 
beling configurations. Markov topic models [11 J use 
the Gaussian Markov random field to describe topic 
interactions among documents in different conferences. 
Nevertheless, all recent pairwise topic models have lim- 



ited expressive power, because they are insufficient to 
depict higher-order relations among tagged documents 
and images. 

Author-topic models (ATM) [20 [ (Fig.[2£) and labeled 
LDA (L-LDA) (2TJ (Fig. |2p)) are able to associate ob- 
served tags with words directly. ATM uses a document- 
specific uniform distribution Ud to generate a tag, and 
further uses the tag-specific topic proportions 6 t to gen- 
erate a topic label for the word. The plate on 9 indicates 
that there are T unique tags. All documents with the 
tag t will share the same tag-specific topic proportions 
9t, which implicitly encodes the pairwise relation of 
documents associated with the tag t. L-LDA constrains 
latent topics to be observed tags generated from the 
document-specific topic proportions 6d over tags, where 
each tag is further associated with a multinomial dis- 
tribution <\> t to generate words. In this sense, L-LDA 
is a supervised topic model because it replaces latent 
topic labels as observed tags. All documents with the 
tag t will share the same multinomial distribution <fi t , 
which also encodes statistical information of documents 
associated with the tag t. However, the higher-order 
relations among documents and images due to multiple 
connected tags have been largely neglected in both ATM 
and L-LDA, which motivates us to explore a more 
specific higher-order TTM in this study. 

3 MRF for Topic Modeling 

3.1 The Labeling Problem 

Table Q] summarizes some important notations in this 
paper. From a new perspective, this subsection formu- 
lates the topic modeling as the labeling problem within 
the MRF framework. The objective of topic modeling is 
to assign a set of semantic topic labels z = {z W: d} to 
explain the observed bag of words w = {w,d}, where 
1 < w < W is the word index in the vocabulary 
and 1 < d < D is the document index in the corpus. 
Generally, the topic label z w _d takes one of the topic index 
1 < j < J and partitions all words into J topic groups, 
so that the topic modeling technique is often viewed as 
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TABLE 1 
Notations 



1 < d < D 


Document index 


1 < vi < W 


Vocabulary word index 


1 < t < T 


Tag index 


l<j<J 


Topic index 


w = {w , d\ 


Bag of words 


z = {z w ,d} 


Topic labels for words 


z — w,d 


Labels for d excluding w 


z w, — d 


Labels for w excluding d 


o d 


Factor of document d 




Factor of word w 


It 


Factor of tag t 




Factor hyperedge 




Topic messages 


M z -,d) and [t(z Wr ) 


12 w K z w,d) and U d £t(*™,d) 


/(•) 


Factor functions 




Dirichlet hyperparameters 



one of the word clustering paradigms. In theory, MRF 
solves the labeling problem by assigning the best topic 
labels according to the maximum a posteriori (MAP) esti- 
mation, and this MRF-MAP framework has found many 
important applications in image analysis and computer 
vision IT!!! . More specifically, MRF attempts to find the 
best topic labeling configuration z over words w through 
maximizing the posterior probability p(z|w), which is in 
nature a prohibited combinatorial optimization problem 
in the discrete latent topic space. To avoid the high 
computational cost, MRF often uses smoothness or sparsity 
property of the labeling problem to reduce the total 
number of possible labeling configurations I17I- IIT91 . 

As far as topic modeling is concerned, we only en- 
courage smoothness of neighboring topic labels, i.e., the 
neighboring topic labels tend to be the same. To this 
end, we define the neighborhood system of the topic 
label z w d as Z— w d an d z„ j, where z~ w a denotes a 
set of topic labels associated with all word indices in 
the document d excluding the word index w, and z w> -d 
denotes a set of topic labels associated with the word 
index w in all documents excluding d. Furthermore, we 
use the factor graph [13), l!T4l Chapter 8.4.3] to represent 
LDA, and treat parameters 9 and <j> as factors with 
parameterized functions [14J. By designing the proper 
factor functions, which are equivalent to clique poten- 
tials, we can encourage or penalize different local la- 
beling configurations in the neighborhood system. More 
specifically, we encourage the topic labeling smoothness 
among {z Wt d, Z- W ,d, z w ,-d}- 

In this paper, we consider a type of LDA with fixed 
symmetric Dirichlet hyperparameters a and (3 |23| in 
order to avoid the complex full Bayesian inference of a 
and (3, respectively. We transform the generative graph- 
ical representation of LDA in Fig. [2]^ to the factor 
graph fl3l , lTT4l in Fig. |3K from the MRF perspective. 
We illustrate the factors 8d and <fi w as squares, and de- 
note their connected variables z w a as circles. Obviously, 
the factors 9d and <fi w connects the neighboring topic 
labeling configurations {z wd , t-w.d, z w-d}- In this way, 
the hierarchically directed graphical model of LDA in 



Fig. [2)^ becomes a more generic undirected graphical 
model in Fig. [3)^. We absorb the observed word index 
w as the index of the factor (f> w/ which is similar to 
absorbing the observed document index d as the index 
of the factor 9d in Fig. [2]A. Because the factors can be 
parameterized functions [ 14J, both Od and <fi w can be the 
same multinomial functions smoothed by the Dirichlet 
priors defined in LDA [23J. Also, both hyperparameters 
can be viewed as pseudo-counts in estimating the corre- 
sponding multinomial distributions. This resembles the 
collapsed GS [23 [ that integrates out parameter variables 
9 and <fr and treats hyperparameters a and /3 as pseudo 
topic counts in order to perform inference on the col- 
lapsed hidden variable space z. 

Recently, LDA has been reformulated as a Bayesian 
network [24], which is one of the constrained undirected 
graphical models (MRF) with causal dependencies be- 
tween hidden variables. Indeed, Fig. EK and Fig. [3K 
reflect two facets of LDA, in which the former focuses 
more on the generative process of the observed words 
hierarchically, while the latter emphasizes more on the 
topic labeling smoothness within the MRF framework. 

The original factor graph representation for MRF lTl3l 
can be naturally extended to describe the generative 
process of a probabilistic model. For example, one ex- 
tension is the directed factor graph [25] that enhances 
the visual language to represent LDA. Because the topic 
modeling task can be formulated as the labeling prob- 
lem from the MRF perspective, the original undirected 
factor graph has enough expressive power to represent 
LDA directly. Although the undirected graph does not 
explicitly emphasize the generative process as the di- 
rected counterpart [25 1, it still captures the underlying 
structural dependencies of hidden variables without loss 
of information. In this sense, the factor graph may be 
a more generic visual representation for both directed 
and undirected graphical models in various real-world 
applications. 

Although the factor graph in Fig. [3K is slightly differ- 
ent from the directed graphical model in Fig. ER/ it can 
fulfill the same topic modeling task using the specific 
neighborhood systems and factor functions. First, both 
Figs. ER and EK have the same neighborhood system 
because the connection of hidden variables remains the 
same. Second, in the next subsection, we shall design 
specific factor functions to realize the same topic mod- 
eling goal as Fig. EK without loss of information. 

3.2 Inference and Parameter Estimation 

The loopy BP [14, Chapter 8] algorithms such as the sum- 
product and max-sum algorithms provide efficient and 
approximate solutions to inference problems for graphs 
with loops in Fig. EK- Rather than directly calculating 
the posterior probability p(z\w), we turn to calculating 
the posterior marginal probability p(z wd ), referred to as 
the message fi(z w d), which can be normalized efficiently 
using a local computation. Message passing proceeds 
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Fig. 3. (A) Factor graph for LDA. (B) Passing messages from the factors 9 d and 4> w to the variable z Wtd . (C) and (D) 
Passing messages from the neighboring variables z- w ^ and z w -d to factors 9 d and <j> w , respectively. Arrows show 
the directions of message passing. 



from variables to factors, and in turn from factors to 
variables until convergence after several iterations. In 
this subsection, we adopt the sum-product algorithm to 
infer the marginal posterior probability (i(z Wl d)- 

The message passing scheme is an instantiation of the 
E-step of expectation-maximization (EM) algorithm [14 1, 
which has been widely used to infer the marginal prob- 
abilities of hidden variables in various graphical models 
according to the maximum-likelihood estimation. For ex- 
ample, the E-step inference for Gaussian mixture models 
(GMM) [26 J, the forward-backward algorithm for hid- 
den Markov models (HMM) |27J, and the probabilistic 
relaxation labeling algorithm for MRF |28| can be all 
formulated within the message passing framework [14J, 
[29 1, 1 30]. After the E-step, we can estimate parameters 
9 and <fi based on the inferred marginal probabilities 
fi>(z w ,d) at the M-step of EM algorithm, which is almost 
the same as those EM algorithms for learning other finite 
mixture models like GMM and GMM-based HMM. More 
details on learning finite mixture models using the EM 
algorithm can be found in the book [14]. 

Fig. [3)3 shows the message passing from two factors to 
the variable. The message fi(z w> d) is the product of both 
input messages, 

K z w,d) K ^e d -^z Wid (z w ,d) x fJ,4, m ^. ZW:d (z w ,d), (1) 

where we use the arrows to denote the message passing 
directions. The normalized message ^(z w> d) is in turn 
passed back to the factors. In Fig. [3f and [3)D, the mes- 
sages from factors to variables can be further calculated 
based on all input messages from neighboring variables 
as follows, 

V6 d ^z w . d ( z w-d) = fe d Y[v(z-w,d)u, (2) 

fJ-<j> w ^z Wid (Zw,d) — Y U* ,-d)P, (3) 

z w ,-d —d 

where z_ w ,d and z w ,-d represent all possible neighboring 
labeling configurations of z w _ d , and /(•) is the factor 
function that evaluates the topic structural dependencies 
of input topic messages. The topic labeling smoothness 
prior implies that only J topic configurations are encour- 



aged in 10 and Q3}. Thus, we can rewrite 10 and 10 as 

M d -+z w , d {z w ,d = i) = fe d n{z-w,d = j)a, (4) 

— W 

Hw^z W:d {z w ,d = i) = Uw Y[K z w ,-d = (5) 

-d 

In practice, Eqs. 10 and 10 often cause the product of 
multiple input messages close to zero (31| . To avoid 
arithmetic underflow, we approximate the product of 
messages by the sum of messages because the product 
value increases when the sum value increases, 

f~[ (i(z- w< d)a oc n(z- Wl d) + a, (6) 

— w —w 

W fj,(z w ,-d)0 ex >^ (i(z w ,-d) + /?• (7) 

-d -d 

Such approximations as 10 and 10 transform the stan- 
dard sum-product to the sum-sum algorithm, which is 
still good at passing messages in MRF with acceptable 
performance t30l , (3T). 

For convenience we use the shorthand notations 

K z -,d) = J^w^zw.d), m( z «v) = EdM^Vd), f4 z -w,d) = 

J2-w K z -w,d), and /i(z W) _ d ) = J2-d K z w.-d) in the 
subsequent formulas. In MRF [14], the factor functions 
/(•) correspond to the clique potentials, which can be 
designed arbitrarily to encode our prior knowledge on 
encouraging or penalizing topic labeling configurations. 
Indeed, the higher value of /(■) encourages passing more 
neighboring messages. Here, we design fg d and fy w as 



fe d - 



J2AK z -w,d = j) + a] 



f<p u 



E«,K z »h = j) + 



(8) 
(9) 



Eq. 10 normalizes the input messages by the total num- 
ber of topics associated with the document d in order 
to make output messages comparable across different 
documents. Eq. 10 normalizes the input messages by the 
total number of messages of all word indices in the vo- 
cabulary in order to make output messages comparable 
across different vocabulary words. 

Combining to yields the complete message 
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update equation, 



li{z. 



W.d 



j ) oc 



j)+a 



EM 



Z—w.d 



w,—d 



: j) + a] 



EJM z ™ ,-<* = i) + $ ' 



(10) 



where the notations — w and — d denote all word indices 
except w and all document indices except d, and the 
notations /Lt(z_ U)( j) and fi{z w -d) represent the sum of 
all possible neighboring messages excluding the current 
message /i(z„^)- We normalize the updated message 
locally in terms of j so that Y^j fJ,(z w d = j) = 1. 

In practice, after finite N iterations, the message will 
converge in the factor graph as shown in Fig. |3|\. BP 
usually converges fast with N < 500. Note that we 
need to multiply the number of word counts or the 
relative word frequencies to the corresponding word 
topic message during message passing and parameter 
estimation. 

Given the inferred marginal posterior probability 
n(zw,d), the parameter estimation of 9 and (j) can be 
performed simply using |(4| and 10 (Figs.|3t and|3jD) by 
adding all input messages including fi(z w> d) evaluated 
by the corresponding factor functions, 



Uj) = 

<l>w(j) 



p( z -,rf = j) + a 



(ii) 

(12) 



Alternatively, we can also derive the parameter esti- 
mation equations using the EM algorithm |T4| . In the 
E-step, we calculate the marginal posterior probability 
n(z w ,d)- Employing the multinomial-Dirichlet conjugacy 
and Bayes' rule, we get the following marginal Dirichlet 
distributions 124], 



p(9 d ) = Dk(9 d \(i(z. 4 ) + a), 
p(4> w ) = Dir(</) lu |^(z tu ,.) + 0), 



(13) 
(14) 



In the M-step, maximizing Jl3l and (14) | with respect to 
9d and <p w also results in the same parameter estimation 
equations |(TT] | and Q2) . 

3.3 Discussion 

LDA is the hierarchical Baysian model that maximizes 
the objective p(w, z) to generate topic labels for words, 
while MRF is the undirected model that maximizes 
the objective p(z|w) to assign the best topic labels to 
words. Their objectives are identical according to the 
Bayes' rule p(z\w) oc p(w,z) since p(w) is a constant 
in terms of z. The collapsed Gibbs sampling (GS) [23] 
and variational Bayes (VB) [1] have been two commonly- 
used approximate inference algorithms for LDA-based 
topic models. In this paper, we provide an alternative 
inference method for LDA-based topic models using the 
BP algorithm from the novel MRF perspective. 



GS resembles the proposed BP except that it randomly 
samples a topic label from the marginal posterior prob- 
ability (J,(z Wt d) for each word token, and immediately 
updates parameters based on the currently sampled 
topic label. Therefore, GS needs to sample a topic label 
for each word token in the document, while BP only 
calculates the message for each word index in the vocab- 
ulary within the document. Due to the word sparsity in 
the document, BP significantly lowers the computational 
cost than GS. In addition, the randomly sampled topic 
label in GS always loses some information compared 
with the marginal posterior probability in BP. As a result, 
BP is more accurate than GS in parameter estimation 
because it keeps and uses the complete messages at each 
learning iteration without loss of information. 

VB uses the Jensen's inequality to get an adjustable 
lower bound on the objective function, and maximizes 
the objective through maximizing the lower bound by 
tuning a set of variational parameters. VB also resem- 
bles the proposed BP except that it calculates the topic 
messages by minimizing the Kullback-Leibler (KL) di- 
vergence between the variational distribution and the 
true posterior distribution. Thus, the variational message 
update equations in VB differ significantly from those 
in BP by involving the more complicated digamma 
functions. 

For each learning iteration, both BP and VB have 
the same computational cost O(JDWd), but GS requires 
O(JDId), where Wd is the average vocabulary size and 
Id is the average number of word tokens per document. 
Because in a document the number of word indices is 
usually much smaller than the total number of word 
tokens due to the word sparsity, i.e., Wd <C Id, BP and 
VB generally scale much better than GS to the large-scale 
corpus. More detailed comparisons among VB, GS and 
BP can be found in [321 - 

4 Tag-Topic Models 

4.1 Factor Hypergraph Representation 

Fig. UK shows the factor hypergraph representation 
of TTM, which directly combines the factor graph in 
Fig. [3j\ with the bipartite graph in Fig. QJl. Note that 
the undirected hypergraph is equivalent to the bipartite 
graph in Fig. [if |[T5l , where the factor hyperedge 5d 
(denoted by the yellow block) connects the tag factors 
7t attached to the document d in Fig. [4£ . In this factor 
hypergraph representation, we absorb the observed tag t 
as the index of the factor 74, which connects the variable 
z w .d with its neighbors z..^ using the solid black line 
as shown in Fig. HJ3. We assume that the document 
pair {d, d'} share the same tag t, the document pair 
{d, d"} share the same tag t', the the document d are 
associated with three tags {t,t',t"}. Although Fig. [4j\ 
does not follow the standard definition of factor graphs 
due to the factor hyperedge Sd, this variant of factor 
graph can represent both pairwise and higher-order 
relations among tagged documents and images as shown 




(A) (B) (C) 

Fig. 4. (A) Factor hypergraph for tag-topic models, where three tags are illustrated for simplicity. (B) Pairwise relation 
modeling: passing messages through the factor j t . (C) Higher-order relation modeling: passing messages through the 
factor hyperedge 5 d denoted by the yellow block. Arrows show the directions of message passing. 



in Fig. [4)3 and Fig. [4p. For example, the topic labeling 
configurations z. ^ or z. )( j" can influence its neighboring 
label z Wt d separately through the factor ^ t or 74/ based 
on the pairwise relation in Fig. [4)3. In the meanwhile, 
{z. t d'i z -,d"} can also influence their neighbor z W) d jointly 
through the factor hyperedge 5d based on the higher- 
order relation resulted from the connected tag factors 
{it, It'} in Fig. Hp. As a result, the factor function / 7t 
encodes the pairwise relation between {z.,d', z Wt d}, while 
the factor function fs depicts the higher-order relation 



among {z. td ', z.,d", z% 



A- 



4.2 Credit Attribution 

Each attached semantic tag usually accounts for parts 
of words in documents or local contents in images. The 
credit attribution task is to associate individual words in 
a document with their most appropriate tags [21 [. In the 
probabilistic framework, we assume that all tags in the 
document associate the same word with different likeli- 
hoods, which can be calculated based on the pairwise 
relation formed by each tag jt as shown in Fig. [4)3. 
Specifically, if x Wt d is the tag label associated with the 
word w in the document d, we calculate the likelihood 
p{x w ,d — t) based on the following similarity in terms of 
topic messages, 

,/ 

p(x w ,d = t) = V* fi(z w>d = j)fJ, 7t -> Zui d (z Wyd = j), (15) 



3=1 



where the message ^ t ^z Wt d{z w ,d = j) from the factor 
jt will be introduced in the next subsections. Intuitively, 
Eq. { Tl"5l l measures the similarity between the tag and the 
word content in the latent topic space. In practice, we 
randomly initialize p{x W: d = t) for all tags per word, 
and iteratively update and normalize it using (f]~5b . 



4.3 Pairwise and Higher-order Relation Modeling 

In Fig. [4)3, the factor function f lt describes the pairwise 
topic dependencies between all pairs of documents con- 
nected with the tag t, 



fn 



J2d,d'ene(t) ^d,t Md',t' 

\d,d' i ne(t)\ 



(16) 



where the o notation denotes the Hadamard (element- 
wise) product of two vectors JD, the ne(i) notation 
denotes the set of all connected documents with the tag 
t, the \d,d' G ne(£)| notation indicates the total number 
of document pairs connected with the tag t, and 
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Y, w ^( z w,d)P{ x wA = t) 



t) 



(17) 



where p(x Wtd = t) is defined in (151 . The Hadamard 
product captures the similarity between two connected 
documents with the tag t in the latent topic represen- 
tations. As a result, Eq. (p~6b is the average Hadamard 
product of all pairs of documents connected with the tag 
7t that encodes the dominant pairwise topic structural 
dependencies. Eq. flTT i is the weighted sum of all word 
messages in the document d with respect to the tag t, 
which can be viewed as the normalized message passed 
from all words in the document d to the tag t. 

In Fig. HP, the factor function fs d depicts the higher- 
order topic dependencies among documents and images 
through connected tags {t,t',t"} G ne(d). Generally, 
modeling the 3-order and 4-order relations is sufficient 
in practice because most documents and images often 
contain less than four tags in Table [2] Without loss of 
generality, here we present the 3-order relation model- 
ing, and the higher than 3-order relation can be modeled 
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similarly. We design fg d for 3-order relation based on the 
Hadmard product as follows, 

Ysd,d',d"£ne(t,t',t") Md.t ° Md'.t' ° 



TABLE 2 
Summarization of four data sets 



fs d - 



(18) 



\d,d',d" ene(t,t',t")\ 

where the \d,d',d" 6 ne(t,t' ,t")\ notation denotes the 
total number of 3-order document or image triples con- 
stituted by the connected tags t, t' and t" , respectively. 
Obviously, Eq. | [T8t is the average Hadmard product of 
all triples of documents or images, capturing the major 
and representative 3-order topic structural dependencies 
among tagged documents and images. 

4.4 Inference and Parameter Estimation 

TTM in Fig. |4j\ contains loops so that we develop the 
loopy BP algorithm [14J, [16[ for approximate inference 
and parameter estimation. In subsection 13.21 we have 
calculated the messages /i0 d -> Zro d and fJ>$ w ->-z m , d m ® 
and (0. In this subsection, we focus on computing 
the message n lt ^ Zui d and fj,s d -> Zw d based on the sum- 
product algorithm, which involves not only pairwise 
relations but also higher-order relations among docu- 
ments in Figs. S£> and [4f , respectively. Under the topic 
smoothness constraint, the message from the factor 74 to 
the variable z Wf d is 

A*7t— >z m ,d = fit ^ t t^rri.tt (19) 
■m£ne(t)\d 

where m E ne(t)\d denotes all connected document 
with the tag t except the current document d in Fig. [4j3. 
Similarly, the message from the factor hyperedge 5d to 
the variable z w d is 



HS d ^z w , d = fs d ^ 

m,m'Gne(t,t')\d 



(20) 



where m,m' € ne(t,t')\d contains all document pairs 
except the current document d connected with the tags 
{t, t'}, respectively. Eq. (p~9b passes messages from the 
neighboring documents by the individual factor j t , 
while d20l passes joint messages by the factor hyper- 
edge Sd, which connects multiple tag factors {7t,7e}. 
Therefore, Eq. fl9l l influences the word message n(z w ,d) 
through the pairwise relation across the individual tag t, 
and i2"0l plays the similar role through the higher-order 
relation across multiple tags {t, t'}. Similar to (6) and (7), 
we replace the product operation by the sum operation 
in (T9l l and (20l for all neighboring input messages in 
order to avoid arithmetic underflow. 

In the standard sum-product algorithm, we calculate 
the marginal posterior probability [i Zw d by the product 
of all input messages according to Fig. |4j\. However, 
the direct product is not flexible to balance the messages 
from factors Qd, "ft and Sd in Fig. |4j\. Conceivably, the 
message /ig d ^ Zw d measures the topic labeling influence 
within the document d, the message /i 7t _j.z ro d captures 
the influence from neighboring documents by pairwise 
relations, and the message fis d ^z u , d plays the similar role 



Data sets 


D 


T 


W 


Id 


w d 


T d 


CORA 


2410 


2480 


2961 


57 


43 


3.0 


MED 


2317 


8906 


8918 


104 


66 


5.8 


C5K 


5000 


371 


128 


11970 


90 


3.5 


C30K 


31695 


1035 


128 


11970 


105 


3.6 



but through higher-order relations. Because these three 
messages are in the document level, we balance them by 
a simple convex combination, and rewrite £1} as 



K Z w,d) OC 



(1 - u>i - u 2 )lJ<e d - 



tene(d) 



X H<t>u 



(21) 



where oji > 0,(^2 > 0,wi + 0J2 < 1 are the weights to 
balance three messages from factors 9d, jt and Sd- In l l2il , 
we sum the messages M 7t -j. z „ d m terms of all individual 
tag t attached to the document d, which accumulates 
the influence from all attached tags. Obviously, Eq. ||2l]| 
shows that the current word message is regularized 
by both pairwise and higher-order relations of tagged 
documents. When oj\ , u>2 = 0, Eq. j2TT l reduces to (l}, so 
that TTM becomes LDA without tag information. When 
UJ3 — 0, we depict only pairwise relations between tagged 
documents. Automatic estimating the best weights in 
TTM requires further studies in future work. In this 
paper, we manually tune these weights based on the 
training data sets. 

The inference and parameter estimation equations for 
TTM are almost the same as those for LDA except 
that the update equation of message /J,(z Wt d) is replaced 
by (21) . Fig. [5] summarizes the loopy BP algorithm 
for learning TTM. At each learning iteration, we need 
to estimate both the pairwise and higher-order topic 
structural dependencies using (16) and | [T8) , so that the 
computational cost of learning TTM is O(JDWL), where 
L is the total number of pairwise and higher-order 
relations among tagged documents and images. 

5 Experimental Results 
5.1 Data Sets 

We use four data sets of tagged documents and images: 
. CORA ED and MEDLINE (MED) O: The for- 
mer contains abstracts from the Cora research pa- 
per search engine in machine learning area, and 
the latter contains abstracts from the MEDLINE 
biomedical paper search engine. We use the author 
names as the tags for each paper. CORA documents 
can be classified into 7 major categories, and MED 
documents fall broadly into 4 categories. 
. COREL5K (C5K) and COREL30K (C30K) El: They 
originate from the Corel stock photograph collec- 
tion. They contain all kinds of images, ranging 
from natural scenes to people portraits or sports 
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input : w, t,ui,u>2, J, N, a, /?. 
output : 9 d ,<f> w . 

initialize: /u 1 (*;_,_ = j),P 1 (x w ,d = t) <— random initialization and normalization, 
begin 

for n <— 1 to N do 



(1 - Wl - t^K,^, + Wl E t gne(d) M" t ^ Zl „. d + <"2/*J__>««.d 



end 

Od(j) <~ = i) + a]/ Ej[M(z-,d = j) + a]; 

4- [ M (z„,. = j) + /^/EJM^,- = j) + /?]; 



end 



Fig. 5. The loopy belief propagation for learning TTM. 



photographs. Each image is associated with man- 
ually labeled tags that depict the main objects ap- 
pearing in the picture. We use the colored pattern 
appearance model (CPAM) [36) to represent each 
image as a bag of visual words. A sliding window 
decomposes the image into 11970 visual words of 
4x4 tile, which are then mapped to one of 128 word 
vocabulary indexes built by the CPAM from lots of 
image patches using vector quantization. 
Table |2] summarizes the statistics of four data sets, where 
D is the total number of documents, T is the total 
number of tags, W is the vocabulary size, Id is the 
average number of words per document, Wd is the 
average vocabulary size per document, and Td is the 
average number of tags per document. 

5.2 Performance of Tag-Topic Models 

In the following experiments, we randomly divide the 
entire CORA and MED documents into training (80%) 
and test (20%) sets. For C5K, we use the same training 
and test set partition of (35l , in which 4500 images 
constitute the training set and the remaining 500 images 
constitute the test set. For C30K, we randomly partition 
the entire images into training (90%) and test (10%) sets. 
We manually tune the weights u>i and u>2 in d2TT > based on 
the perplexity [1J for training data. When u)\ = 0.2, u>2 = 
0, we refer to this TTM as TTM-P for only pairwise 
relation modeling. When u)\ = 0.1, u>2 = 0.05, we refer 
to this TTM as TTM-H for both pairwise and higher- 
order relation modeling. Through the comparative study 
between TTM-P and TTM-H, we can explore the ef- 
fectiveness of modeling higher-order topic interactions 
among tagged documents and images. 

We compare TTM with three current state-of-the-art 
topic models such as RTM (Fig. [2)3) with the exponential 
link probability function [40, ATM (Fig. EC) [200 and 
L-LDA (Fig. |2]D) [210 using the same experimental set- 
tings. As discussed in Section El the above benchmark 

1. http://cran.r-project.org/web/packages/lda/ 

2. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm 

3. http://nlp.stanford.edu/ software/tmt/tmt-0.3/ 



topic models are able to handle pairwise relations be- 
tween tagged documents. In contrast, TTM additionally 
considers higher-order relations induced by connected 
tags among documents. Because L-LDA is a supervised 
topic model, we only compare TTM with L-LDA in 
the tag recommendation task. In all experiments, we 
assume that the tags are unobserved in test data, and 
use the estimated topic distributions from training data 
to predict words, links, tags as well as class labels for 
documents in the test set. 

5.2. 1 Word Prediction 

The word prediction task is to evaluate the likelihood 
that the learned topic distributions generate the unseen 
test data. Fig. __ compares the test set perplexity of 
ATM, RTM, TTM-P and TTM-H. The lower perplexity 
corresponds to the higher likelihood that the learned 
topics can generate the unseen test set. For all data sets, 
TTM-H consistently achieves the lowest perplexity in 
different topics, which implies the best generalization 
ability to predict words in unseen test sets. Unlike RTM, 
ATM does not explicitly model the pairwise topic rep- 
resentations between tagged documents and images, so 
that it insufficiently benefit from the rich relational infor- 
mation for regularizing topic distributions. On the other 
hand, RTM estimates the link probability function for all 
document pairs connected by different tags, while TTM- 
P estimates the tag-specific pairwise relations using <_T_&J) . 
As a result, TTM-P has the potential to capture the 
subtle topic structural dependencies between documents 
or images through specific tags. Fig. __ shows that TTM-P 
achieves almost 15% reduction on average in perplexity 
compared with ATM and RTM. Furthermore, TTM-H 
gains on average 5% reduction of perplexity as com- 
pared with TTM-P, which indicates that joint influence 
through higher order relations may paly positive roles in 
topic distribution regularization. Although TTM-H has a 
higher computational complexity than TTM-P, it is worth 
gaining a better word prediction performance in real- 
world applications. Generally, the predictive perplex- 
ity decreases as the number of topics increases, which 
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Fig. 6. Comparison of word prediction performance measured by the perplexity on test set. 



ATM 
RTM 
TTM-P 
TTM-H 


knowledge system paper design reasoning problem theory approach case systems 
system knowledge learning paper reasoning case approach planning design cases 
design system reasoning case knowledge theory cases systems approach planning 
knowledge system learning design reasoning case theory problem systems approach 


ATM 
RTM 
TTM-P 
TTM-H 


genetic problem search algorithms programming paper results problems optimization evolutionary 

problem algorithm learning performance results paper method algorithms genetic search 
genetic problem search algorithms programming optimization evolutionary fitness population results 
genetic search problem programming algorithms evolutionary fitness optimization population performance 


ATM 
RTM 
TTM-P 
TTM-H 


network neural networks learning input training recurrent paper trained units 
network neural models networks paper model statistics design university research 
network neural networks learning input time recognition training algorithm recurrent 
network neural networks learning weights parallel performance control systems architecture 



Fig. 7. Each row shows one topic (top ten words) of ATM, RTM, TTM-P and TTM-H on the CORA training set. 



implies that the more latent topics provide the higher 
likelihood to predict words. 

To measure the interpretability of a topic model, the 
word intrusion and topic intrusion are proposed to in- 
volve subjective judgements [37]. The basic idea is to 
ask volunteer subjects to identify the number of word 
intruders in the topic as well as the topic intruders in the 
document, where intruders are defined as inconsistent 
words or topics based on prior knowledge of subjects. 
Due to lack of volunteer subjects, Fig. shows only 
three consistent topics with top ten words on the CORA 
training set for qualitative evaluation. We see that most 
topics share similar words with different ranking orders. 
Nevertheless, both ATM and RTM extract first two topics 
that contain the word intruder "paper", and RTM even 
extracts three word intruders "design", "research" and 
"university" in the third topic. Obviously, both TTM-P 
and TTM-H show much better interpretability at least 
for the top ten words, which do not contain irrelevant 
common words such as "paper". Moreover, TTM-H is 
slightly better than TTM-P in that it has a more natural 
word ranking order in each topic. 

5.2.2 Link Prediction 

The link prediction task is to predict if two documents 
share the same tag. The natural real-world application 



of link prediction is to suggest tags of a document to a 
linking document. If the tags are author names, we may 
use the link prediction to find reviewers or collaborators 
for the linking document. Also, the link prediction can 
help retrieve related documents with similar tags. The 
effectiveness of these applications depend highly on the 
link prediction accuracy. We define the link prediction 
as a binary classification problem. We use the Hadmard 
product of a pair of document topic proportions as the 
link feature, and train an SVM [38J to decide if there 
is a link between them. We evaluate link prediction 
performance using the same number of linking /non- 
linking training and test samples. 

Fig. [8] compares the F-measure of link prediction. 
Because ATM does not encode the Hadmard link fea- 
tures of pairs of documents, its prediction results are 
almost random guess with F-measure close to 0.5 for 
all data sets. In contrast, RTM shows a significantly 
better link prediction performance using the generalized 
linear models estimated from the link features, which 
efficiently differentiate links from non-links. For all data 
sets, TTM-P deviates slightly from RTM because both 
TTM-P and RTM encode only pairwise relations of 
tagged documents. However, TTM-H outperforms RTM 
around 8% F-measure for link prediction. One possible 
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Fig. 8. Comparison of link prediction performance based on the Hadmard product of document topic proportions. 



reason is that TTM-H incorporates much richer higher- 
order topic structural dependencies so that it makes 
the topic proportions of documents sharing tags more 
differentiable from those documents without sharing 
tags. Interestingly, F-measure does not always increase 
as the number of topics increases. Although more latent 
topics can predict unseen words better as shown in 
Fig. [6] they cannot consistently enhance the link pre- 
diction performance on the test set as shown in Fig. [8] 
This phenomenon suggests that the content similarity 
between documents alone cannot completely account 
for the link information. Additional information such as 
partially observed links of some documents may help 
for a better link prediction performance. 

5.2.3 Document Classification 

Document classification partitions a set of documents 
into several mutually exclusive categories. Topic mod- 
els can be used as a dimensionality reduction method 
to reduce the high-dimensional word vector space for 
classification [1J. We may use the document topic pro- 
portions as the reduced feature vectors and study their 
discriminative ability in document classification. To this 
end, we train SVMs on the document topic proportions 
given class labels, and compare the document classifi- 
cation accuracy on the test set. In CORA, we randomly 
select 100 training samples for each of the seven cate- 
gories. In MED, we randomly select 200 training samples 
for each of the five categories. In C5K and C30K, we 
choose four tags as class labels: sky, water, trees, and 
people. We use those images associated with only one 
of four tags for training purposes. In C5K, we randomly 
select 300 training samples for each class. In C30K, we 
randomly select 1500 training samples for each class. The 
remaining documents and images are test samples. 

Fig. |9] shows the classification accuracy based on low- 
dimensional document topic proportions. We see that 
ATM generally outperforms RTM, which is inconsistent 
with their word prediction performance in Fig. [6] The 
reason may lie in that RTM treats sharing tags as equal 



links, but in reality different tags may encode different 
topic structural dependencies between documents. Thus, 
RTM may erroneously encourage the topic smoothness 
of documents through different tags, which often has the 
close correspondence with the class labels of documents, 
especially when tags are used as class labels for C5K 
and C30K. In contrast, TTM-P relaxes the limitation 
in RTM by encouraging the smoothness of document 
topic proportions using the tag-specific pairwise relation 
modeling. Furthermore, TTM-H still outperforms TTM- 
P with 6% higher classification accuracy on average by 
forcing tag-specific smoothness constraint through both 
pairwise and higher-order relations. Image classification 
performance on C5K and C30K is generally worse than 
that on CORA and MED, partly because the tags tend 
to describe individual image components, which are 
not exactly equivalent to class labels that describe the 
global image contents. Similar to the link prediction 
task, the more latent topics does not enhance the overall 
document classification performance. 

5.2.4 Tag Recommendation 

Tag recommendation is a multi-label classification prob- 
lem that suggests a set of tags to query documents or 
images, which has found many real-world applications 
such as credit attribution (2TJ, expert finding [39] and 
image annotation [35], [40]. Due to lack of benchmark 
data to evaluate the expert finding performance, we fo- 
cus on tag recommendation for image annotation in this 
section. We propose an TTM-based tag recommendation 
system including two SVMs: 

1) Each tag is a class label. We train a multiclass SVM 
called svmi to classify the image topic proportions 
into T tags, where the training samples are images 
associated with each tag. Some images may have 
more than one tag and will be used as train- 
ing samples for multiple tags. For each training 
sample d, svmi predicts a vector of likelihoods 
P d = \pi, ■ ■ ■ ,Pt, ■ ■ ■ ,Pt] for all tags. 
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Fig. 9. Comparison of document classification accuracy based on document topic proportions. 



2) We also train a total of T binary SVMs called svm,2 
for all tags. For the tag t, the positive sample is 
the tagged image d with the feature vector = 
[Pt,Pt'<£ne(t)] predicted by svmi, where ne(t) is a set 
of connected tags of the tag t. This feature encodes 
information of connected tags for robust prediction. 
To balance the training data for each tag, we choose 
the same number of positive/ negative samples. For 
each training sample d, svm^ predicts a vector of 
likelihoods i4 = [<J,E^m = M = {1,2}, 
where fif x is the likelihood that the tag t is rec- 
ommended to d. 

3) For the test image d, we use svmi to predict its 
likelihoods p d to all T tags. Then, we use svm,2 to 
predict fif , 1 < t < T for all tags. To balance the 
prediction results of svmi and svm^, we linearly 
combine the two likelihoods y = ujpf + (1 — u))fifi 
by the best mixture weight cj = 0.25 estimated from 
the training set. We follow the standard image an- 
notation evaluation protocol [35J, |40|, and suggest 
top five tags to the query image with highest y. 

In this system, svmi uses only image content informa- 
tion to suggest tags, while sviri2 uses connected tags to 
refine the tag recommendation result. The basic idea is 
that if the tag t is suggested to the image d, its connected 
tags also have a high likelihood to be suggested. 

The performance measures for image tag recommen- 
dation include recall and precision rates per tag [35 J, 
1 40 1 . More specifically, for a given tag, let Nh be the 
number of images in the test set that are labeled with 
this tag by human, N s be the number of images in the 
test set that are labeled with tag by the tag recommen- 
dation system, and N c be the number of images that 
the system gives correct tag recommendation. The recall 
and precision rates are defined as recall — N c /Nh and 
precision = N c /N s . We also evaluate the coverage rate 
Rate + of recommended tags, which is calculated as the 
number of tags with positive recall divided by the total 
number of tags in the test set. The higher Rate + implies 
a better generalization ability that can achieve relative 



TABLE 3 

Comparison of Tag Recommendation 



C5K 


Recall 


Precision 


Rate+ 


TTM-H 


0.33 


0.22 


53.85% 


TTM-P 


0.30 


0.21 


49.29% 


L-LDA 


0.26 


0.14 


50.77% 


SML 


0.29 


0.23 


52.69% 


C30K 


Recall 


Precision 


Rate + 


TTM-H 


0.22 


0.13 


45.89% 


TTM-P 


0.20 


0.11 


42.00% 


L-LDA 


0.11 


0.07 


30.40% 


SML 


0.21 


0.12 


44.63% 



high recall and precision rates on a large set of tags. 

Table. [3] compares TTM-H and TTM-P with two state- 
of-the-art tag recommendation methods L-LDA [21 J and 
SML [40]. With the similar coverage rate, TTM-H pro- 
vides the competitive image annotation performance 
with SML. Although L-LDA shows the comparable or 
better tag recommendation performance than SVM for 
tagged web pages, it does not show clear advantages 
in image annotation problem especially on the C30K 
data set. Indeed, L-LDA does not use the connected 
tag information from training data, which play major 
roles to rule out many false positives to enhance the 
average precision. We see that TTM-H still outperforms 
TTM-P, which is consistent with its superior document 
classification performance as shown in Fig. [9] Further- 
more, the more latent topics does not improve the tag 
recommendation performance, so that we show only the 
best results of TTM-H and TTM-P when the number of 
latent topics J = 20. 

6 Conclusions 

This paper has presented TTM and discussed its effec- 
tiveness in encoding smoothness pairwise and higher- 
order topic interactions among tagged documents and 
images. Within the MRF framework, TTM allows the 
efficient loopy BP algorithm for inference and parameter 
estimation. On four large-scale data sets, TTM consis- 
tently outperforms current state-of-the-art topic models, 
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such as ATM, RTM and L-LDA, in several real-world 
text and image mining applications. 

Furthermore, we observe that the higher-order rela- 
tions also exist in many important computer vision and 
text mining applications. For example, the unsupervised 
activity perception in crowded and complicated scenes 
involves lots of higher-order interactions of multiple 
agents, which can be encoded in topic models for discov- 
ering more specific motion patterns. Another example is 
tracking historical topics from time-stamped documents. 
We speculate that the higher-order temporal topic inter- 
actions may characterize some specific long-range topic 
evolution patterns, which can be also studied in our 
future work. 
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